Virtual YouTubers, or VTubers, are online entertainers, who are typically Japanese-speaking YouTubers or live streamers. They use avatars created with programs such as Live2D, portraying characters designed by online artists. VTubers are associated with Japanese popular culture and aesthetics, such as anime and manga, and moe anthropomorphism with human or non-human traits. Some VTubers are anthropomorphic, non-human characters such as animals.
A VTuber's avatar is typically animated using a webcam and software, which captures the streamer's motions, expressions, and mouth movements, and maps them to a two- or three-dimensional model.
This project will provide the analysis of Vtubers to familiarize people with the brand new branch of the entertainment industry

The data consists of several datasets: general information about youtube channel, superchat(youtube donation) stats and 3 files, consisting information, obtained by scraping websites (check my web scraping file for more detailed explanation)
Number of observations - 909
channelId - Id of an youtube channel, wich is used to access channel page through url.
name - Name of a channel.
englishName - English name of channel. This column is used because VTubers' originates in Japan.
affiliation - membership of channel to a certain agency
subscriptionCount - Number of subscribers of the channel
videoCount - Number of videos on the channel
photo - Url of channel's profile picture
Number of observations - 3353
channelId - Id of an youtube channel, wich is used to access channel page through url.
period - Time period of data acquisition
superChats - Number of superchats
uniqueSuperChatters - Number of unique super chatters
totalSC - Total amount of superchats (JPY)
averageSC - Average amount of superchats (JPY)
totalMessageLength - Total superchat message length
averageMessageLength - Average superchat message length
mostFrequentCurrency - Most frequent donated currency
mostFrequentColor - Most frequent donation color. Used, because youtube superchats change their color depending on donated sum. Picture of significance table is related.

Number of observations - 127,645
amount - Amount of donated money
currency - Description of currency type
significance - Superchat significance
channelId - Id of an youtube channel, wich is used to access channel page through url.
Number of observations - 15
name - YouTube channel name
revenue - Total amount of superchatted money of all time
Number of observations - 1997
name - Youtube channel name
registration date - Date of channel registration on the YouTube
Vtuber - Vtuber name
Dates - list of dates, continuing up to the november 2021.
For this project, analysis and visualization consists of 5 parts:
Now we need to prepare and clean datasets:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
channels = pd.read_csv('Datasets/channels.csv')
vshojo_subs = pd.read_csv('Datasets/vshojo.csv')
top_revenue = pd.read_csv('Datasets/top_revenue.csv')
superchat_stats = pd.read_csv('Datasets/superchat_stats.csv')
superchat03 = pd.read_csv('Datasets/superchats_2021-03.csv')
superchat04 = pd.read_csv('Datasets/superchats_2021-04.csv')
superchat05 = pd.read_csv('Datasets/superchats_2021-05.csv')
superchat06 = pd.read_csv('Datasets/superchats_2021-06.csv')
superchat07 = pd.read_csv('Datasets/superchats_2021-07.csv')
superchat08 = pd.read_csv('Datasets/superchats_2021-08.csv')
superchat09 = pd.read_csv('Datasets/superchats_2021-09.csv')
superchat10 = pd.read_csv('Datasets/superchats_2021-10.csv')
vtubers_date = pd.read_csv('Datasets/vtubers_date.csv')
#drop unnecessary columns
channels = channels.drop(columns = {'photo'})
channels.head()
#Convert string to the actual number and change names to match with initial dataset
vshojo_subs['subs'] = vshojo_subs['subs'].apply(lambda x: float(x.split()[:1][0].replace(',', '.')) * 1000)
vshojo_subs.name.replace({'HimeHajime': 'Hime Hajime', 'Apricot': 'Froot', 'ironmouse' : 'Ironmouse', 'Nyanners': 'Nyatasha Nyanners'}, inplace = True)
vshojo_subs.head()
# Convert revenue to amount of United States' Dollars
top_revenue['revenue'] = top_revenue.revenue.apply(lambda x: float(x.split()[-1].replace(',', ''))/433,24)
top_revenue.head()
#drop unnecessary columns and null rows
vtubers_date = vtubers_date.drop(columns = {'subs'})
vtubers_date = vtubers_date[vtubers_date['registration date'].notna()]
vtubers_date.head()
#there appeared to be 236 null rows
# merge datasets
superchat03 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat03, on = 'channelId')
superchat04 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat04, on = 'channelId')
superchat05 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat05, on = 'channelId')
superchat06 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat06, on = 'channelId')
superchat07 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat07, on = 'channelId')
superchat08 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat08, on = 'channelId')
superchat09 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat09, on = 'channelId')
superchat10 = pd.merge(channels[['channelId', 'name', 'affiliation']], superchat10, on = 'channelId')
First of all, let's import our libraries, set plots and create formatting function to give our plots an appropriate appearance when it deals with large numbers.
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
from ipywidgets import interact
import plotly.graph_objects as go
import plotly.express as px
from sklearn import linear_model
plt.rc('font', size = 16)
plt.rc('legend', fontsize = 16)
plt.rc('figure', figsize = (16, 9))
plt.style.use('fivethirtyeight')
def reformat_values(val, pos):
if val >= 1000000000:
val = round(val/1000000000)
new_format = '{:}B'.format(val)
elif val >= 1000000:
val = round(val/1000000)
new_format = '{:}M'.format(val)
elif val >= 1000:
val = round(val/1000)
new_format = '{:}K'.format(val)
elif val < 1000:
new_format = round(val)
else:
new_format = val
return new_format
def reformat_count(val):
if val >= 1000000:
val = round(val/1000000, 2)
new_format = '{:}M'.format(val)
elif val >= 1000:
val = round(val/1000, 2)
new_format = '{:}K'.format(val)
elif val < 1000:
new_format = round(val, 2)
else:
new_format = val
return str(new_format)
def reformat_list(val_list, sign):
new_val_list = []
for val in val_list:
if val >= 1000000:
val = round(val/1000000, 2)
val = sign + str(val)
new_val = '{:}M'.format(val)
new_val_list.append(str(new_val))
elif val >= 1000:
val = sign + str(val)
val = round(val/1000, 2)
new_val = '{:}K'.format(val)
new_val_list.append(str(new_val))
elif val < 1000:
val = sign + str(val)
new_val = round(val, 2)
new_val_list.append(str(new_val))
else:
val = sign + str(val)
new_val_list.append(str(val))
return new_val_list
def currency_exchanger(amount, init_cur):
c = CurrencyRates()
return round(c.convert(init_cur, 'USD', amount), 2)
Before building a graph, let's replace total subscription stats of VShojo agency. We need it because data from 'channels' dataframe was obtained from YouTube only, but VShojo is actually a twitch-based agency and that's why I decided to get VShojo statistics directly from twitch in order to fix their stats in this dataset. See more about data acquisition process in Web Scraping file
popularity = channels.groupby('affiliation').sum().drop('videoCount', axis = 1)
popularity = popularity.replace(popularity[popularity.index == 'VShojo']['subscriptionCount'][0], vshojo_subs.sum()[1])
popularity = popularity.sort_values('subscriptionCount', ascending=False).head()
popularity
#value for plotting average line
mean_subs = popularity.mean()[0]
#Create condition to discolor bars below average line
above_aver = ['grey' if (x < mean_subs) else '#792383' for x in popularity['subscriptionCount']]
bars = plt.bar(popularity.index, popularity['subscriptionCount'], color = above_aver, edgecolor = 'black')
#Write bars' value inside of them
for i, v in enumerate(popularity['subscriptionCount']):
plt.text(i-.225,
v/2.5,
reformat_count(v),
fontsize=25,
color='white')
#plot line with average subscribers count and write a label to it
plt.axhline(y = mean_subs, linestyle = '--', color = '#C1C1C1')
plt.text(
x = 4.25,
y=mean_subs,
s="Average subscribers",
backgroundcolor="grey",
color = 'white',
fontsize=14)
#Format y-axis
plt.gca().yaxis.set_major_formatter(tick.FuncFormatter(reformat_values))
#Titles and labels
plt.xlabel('Agencies')
plt.ylabel('Number of subscribers')
plt.title('Figure 1.1 The most popular VTuber agencies')
#Color bars to give more attractive view
bars[0].set_color('#27c7ff')
bars[0].set_edgecolor('black')
bars[1].set_color('#294a72')
#fix spacing
plt.tight_layout()
plt.show()
Looking at the figure 1.1, we can observe huge domination of two giant agencies - Hololive and Nijisanji. But independent VTubers are also dominating over other agencies. Are streamers without an agency more popular than the rest? Let's dive into these three affiliations, compare their size and most popular talents
channels.groupby('affiliation').count()[['channelId']].sort_values('channelId', ascending = False).head(3)
Looking at the stats above we can easily answer question about independent VTubers - their popularity is due to their large number of streamers, which takes about 30% of whole dataset
#sort dataset to get only desired groups
best_talents = channels[(channels.affiliation == 'Hololive') | (channels.affiliation == 'Nijisanji')\
| (channels.affiliation == 'Independents')]
#create interactive function that will return certain plot
def best_rate(x):
#Divide by color
if x == 'Hololive':
color = '#27c7ff'
elif x == 'Nijisanji':
color = '#294a72'
else:
color = '#792383'
plt.figure(figsize = (18, 15))
#take two columns with subscribers and channel name and then convert it to the list
agency_best_talents = best_talents[best_talents.affiliation == x][['englishName', 'subscriptionCount']]\
.sort_values('subscriptionCount', ascending=False).head(20).sort_values('subscriptionCount')
talent_name = list(agency_best_talents['englishName'])
talent_rate = list(agency_best_talents['subscriptionCount'])
#create a horizontal bar plot with lists above
plt.barh(talent_name, talent_rate, color = color, edgecolor = 'black')
#assign bar values inside of them to show how many subs are there
for i, v in enumerate(talent_rate):
plt.text(v/100, i-.15, reformat_count(v), color='white', fontsize = 14, fontweight = 'demibold')
#reformat x-axis to make more comfortable view
plt.gca().xaxis.set_major_formatter(tick.FuncFormatter(reformat_values))
#return plot
plt.title('Figure 1.2 Top 20 most subscribed VTubers of the ' + x + ' agency')
plt.tight_layout()
return plt.show()
#activate interactive function
interact(best_rate, x = best_talents.affiliation.unique())
Despite having much less talents than their closest rival, Hololive members appear to be more recognizable and more popular, than ones in the Nijisanji company. Taking this into account, it can be assumed that Hololive is the most popular agency in the world
Let's analyze which virtual streaming agency do generate more content. Due to need in affiliation, we won't count independent VTubers.
#prepare data to be plotted
treemap_data = channels.dropna().groupby('englishName').agg({
'subscriptionCount': 'first',
'videoCount': 'first',
'affiliation': 'first',
'group': 'first',
'name': 'first'
}).reset_index()
treemap_data
px.treemap(treemap_data, path = ['affiliation', 'group', 'name'], values = 'videoCount',
width=1100, height=720, title = 'Figure 2 Treemap of generated content distribution at the platform')
treemap_data['videoCount'].sum()
Once again, we can see dominance of Hololive and Nijisanji. They take about 3/4th of whole content field, but it's only due to their vast size. However, we can see that there are pretty much enough content to watch - about 110k videos
YouTube does provide a superchat function. Superchat is a way for viewers to support their favorite YouTubers by sending them money and it's actually one of the most efficient way to earn money from YouTube.
We now know that VTubers are pretty popular and there is a massive amount of content to watch. But how about their income? Let's analyze it now.
In order to answer question above, I obtained data from Playboard, which contains data about most superchatted (donated) channels EVER
color = ['pink' if i in list(channels['name']) else 'grey' for i in top_revenue['name']]
fig = go.Figure(go.Bar( x = top_revenue['name'],
y = top_revenue['revenue'],
marker={'color': color},
text = reformat_list(top_revenue['revenue'], '$'),
textposition = 'inside'))
fig.update_layout(title_text='Figure 3.1 Top 15 most superchatted YouTube channels ever')
Figure above shows that 12 out of 15 most superchatted YouTubers are VTubers and their maximal income is about 3 million dollars. Let's dive into superchats and see how many superchats and money do VTubers gain
frequent_currencies = superchat06.groupby('currency').count().sort_values('amount', ascending = False).head().index
superchat_list = [superchat03, superchat04, superchat05, superchat06, superchat07, superchat08, superchat09, superchat10]
def money_line(Affiliation, Currency):
month = ['Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct']
money = []
for i in superchat_list:
money.append(i[(i['affiliation'] == Affiliation) & (i['currency'] == Currency)]['amount'].sum())
plt.plot(month, money, color = '#27c7ff', marker = 'o')
plt.gca().yaxis.set_major_formatter(tick.FuncFormatter(reformat_values))
plt.xlabel('Months')
plt.ylabel('Amount of superchatted money')
plt.annotate('*data was obtained in period of 03.2021 - 10.2021', (0,0), (700,-80),
fontsize=15, xycoords='axes fraction', textcoords='offset points')
for i,j in zip(month,money):
plt.annotate((reformat_values(j, i)), xy = (i,j*1.02))
plt.title('Figure 3. 2 Amount of supperchatted money to ' + Affiliation +' VTuber group (' + Currency + ')')
return plt.show()
def money_bar(Currency, TimePeriod):
month = ['Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct']
period = month.index(TimePeriod)
sc_distrib = superchat_list[period][superchat_list[period].currency\
== Currency].groupby(['affiliation']).sum().sort_values('amount', ascending = False).head()
vtuber_groups = list(sc_distrib.index)
money = list(sc_distrib['amount'])
plt.bar(vtuber_groups, money, color = '#294a72', edgecolor = 'black')
plt.gca().yaxis.set_major_formatter(tick.FuncFormatter(reformat_values))
for i, v in enumerate(money):
plt.text(i-.175,
v/5,
reformat_count(v),
fontsize=15,
color='white', fontweight = 'semibold')
plt.annotate('*data was obtained in period of 03.2021 - 10.2021', (0,0), (700,-80),
fontsize=15, xycoords='axes fraction', textcoords='offset points')
plt.xlabel('Agencies')
plt.ylabel('Amount of superchatted money')
plt.title('Figure 3.2 Amount of supperchat money in ' + TimePeriod +' (' + Currency + ')')
return plt.show()
def choose_interpret(PlotType):
if PlotType == 'bar':
interact(money_bar, Currency = frequent_currencies, TimePeriod = ['Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct'])
else:
interact(money_line, Affiliation = superchat03.affiliation.unique(), Currency = frequent_currencies)
interact(choose_interpret, PlotType = ['bar', 'plot'])
Now we are able to see income of top VTubers agency, splitted by months. Exploring a graph, we can assume that VTuber industry is pretty profitable, but most of overall profit belongs to a pair of giant agencies - Hololive and Nijisanji. Independent VTubers gain much less money even having much more talents
We did count amount of superchats, but when are the most profitable days to stream?
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
for i in superchat_list:
i['timestamp'] = pd.to_datetime(i['timestamp'])
day_mapping = {
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
sc_time = 0
for i in superchat_list:
i['day'] = i.timestamp.dt.dayofweek.map(day_mapping)
sc_time += i.groupby(['day', 'significance']).count().reindex(level='day', labels=days).unstack('significance')['channelId']
sc_time
fig = px.bar(sc_time, x = sc_time.index, y = sc_time.columns)
fig.update_layout(title_text='Figure 4 Distribution of superchats quantity by days of week')
From the chart 4 we can assume that days with the most number superchats are weekends, which is actually not surprising, considering that people are not busy with work and this days are so-called 'superchat reading' days during which vtubers read messages in super chats and interact with audience the most
Now we know that VTuber is beneficial profession and they are pretty popular, but when exactly did they become popular? Let's take a look at time series of subscribers rise of Hololive agency, because we already know that Hololive is the most popular and most profitable agency.
Data was obtained from Mason Palmer
holo_sub = pd.read_excel('Datasets/holo_sub_count.xlsx')
holo_sub.head()
time_period = list(pd.to_datetime(holo_sub.columns[3:]).to_period('M'))
for i in range(len(time_period)):
time_period[i] = str(time_period[i])
holo_time = holo_sub.groupby('Generation').sum().transpose().reset_index()
holo_time['index'] = time_period
holo_time.head()
mean_subs = holo_time.groupby('index').mean().transpose().sum().transpose()
mean_diff = [0]
for i in range(1, len(mean_subs)):
mean_diff.append(mean_subs[i] - mean_subs[i-1])
max_mean_diff_ind = mean_diff.index(max(mean_diff))
colors = ['#ff0000', '#ff8000', '#ffff00', '#80ff00', '#00ff00', '#00ffff', '#0080ff', '#0000ff', '#8000ff', '#ff00ff', '#ff0080', '#00ff80']
ind = 0
for i in holo_time.columns[1:]:
plt.plot(time_period, holo_time[i], label = i, color = colors[ind])
ind+=1
plt.gca().yaxis.set_major_formatter(tick.FuncFormatter(reformat_values))
plt.xticks(time_period[::4])
plt.axvline(x = time_period[max_mean_diff_ind], linestyle = '--', color = '#C1C1C1', linewidth = 3)
plt.text(
x = time_period[max_mean_diff_ind],
y=12*10**6,
s="Greatest slope",
backgroundcolor="grey",
color = 'white',
fontsize=14)
plt.ylabel('Number of subscribers')
plt.xlabel('Time Period')
plt.title('Figure 5 Hololive Subscribers change rate by time')
plt.legend(prop={"size":10})
Here we can see that followers rate increases dramatically in the beginning of 2021 year. There could be be several reasons, such as debut of english speaking branch or lack of animated content due to industrial restrictions because of COVID-19.
Now let's take a look about number of appearance of new VTubers per each year. Data was obtained from youtube and userlocal, that does track vast amount of VTubers.
def reg_rate_change_format1(x):
x_list = x.split(',')
x_list[0], x_list[1] = x_list[1], x_list[0]
x_list[0] = x_list[0][-2:]
return ', '.join(x_list)
vtubers_date['registration date'] = vtubers_date['registration date'].apply(reg_rate_change_format1)
vtubers_date
vtubers_reg_rate = vtubers_date.groupby('registration date').count()
year_ind = list(vtubers_reg_rate['name']).index(vtubers_reg_rate['name'].max())
plt.figure(figsize = (20, 15))
plt.plot(vtubers_reg_rate.index, vtubers_reg_rate['name'], color = '#792383')
plt.xticks(vtubers_reg_rate.index[::3])
plt.axvline(vtubers_reg_rate.index[year_ind], color = '#C1C1C1', linestyle = '--')
plt.text(
x = vtubers_reg_rate.index[year_ind],
y=160,
s="Highest number of registrations",
backgroundcolor="grey",
color = 'white',
fontsize=14)
plt.xlabel('Time Period')
plt.ylabel('Number of VTuber registrations')
plt.title('Figure 5.2 Number of VTubers registrations per year quarters')
Here we do have some interesting statistics: On the figure 5.2 we can see two spikes, highest of which falls on the second half of 2018. Second spike's timestamp(end of 2020 - beginning of 2021) matches with Hololive's subscribers growth and COVID-19 quarantine measures from previous chart
Based on the analysis we could conclude several assumptions about VTubers industry